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16 Abstract 

17 Protein-protein interactions are fundamental to many biological processes. Experimental 

18 screens have identified tens of thousands of interactions and structural biology has provided 

19 detailed functional insight for select 3D protein complexes. An alternative rich source of 

20 information about protein interactions is the evolutionary sequence record. Building on 

21 earlier work, we show that analysis of correlated evolutionary sequence changes across 

22 proteins identifies residues that are close in space with sufficient accuracy to determine the 

23 three-dimensional structure of the protein complexes. We evaluate prediction performance 

24 in blinded tests on 76 complexes of known 3D structure, predict protein-protein contacts in 

25 32 complexes of unknown structure, and demonstrate how evolutionary couplings can be 

26 used to distinguish between interacting and non-interacting protein pairs in a large complex. 

27 With the current growth of sequence databases, we expect that the method can be 

28 generalized to genome-wide elucidation of protein-protein interaction networks and used 

29 for interaction predictions at residue resolution. 
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30 Introduction 

31 A large part of biological research is concerned with the identity, dynamics and specificity of 

32 protein interactions. There have been impressive advances in the three-dimensional (3D) 

33 structure determination of protein complexes which has been significantly extended by 

34 homology-inferred 3D models 1 ' 2 ' 3,4 . However, there is still little, or no, 3D information 

35 for ~80% of the currently known protein interactions in bacteria, yeast or human, amounting 

36 to at least ~30,000/~6000 incompletely characterized interactions in human and E. coli, 

37 respectively 2,5 . With the rapid rise in our knowledge of genetic variation at the sequence 

38 level, there is increased interest in linking sequence changes to changes in molecular 

39 interactions, but current experimental methods cannot match the increase in the demand 

40 for residue-level information of these interactions. One way to address the knowledge gap 

41 of protein interactions has been the use of hybrid, computational-experimental approaches 

42 that typically combine 3D structural information at varying resolutions, homology models 

43 and other methods 6 , with force field-based approaches such as Rosetta Dock, residue cross- 

44 linking and data-driven approaches that incorporate various sources of biological 

45 information 1,7 16 . However, most of these approaches depend on the availability of prior 

46 knowledge and many biologically relevant systems remain out of reach, as additional 

47 experimental information is sparse (e.g. membrane proteins, transient interactions and large 

48 complexes). One promising computational approach is to use evolutionary analysis of amino 

49 acid co-variation to identify close residue contacts across protein interactions, which was 

50 first used 20 years ago 17 ' 18 , and subsequently used also to identify protein interactions 19,2 °. 

51 Others have used some evolutionary information to improve a machine learning approach to 

52 developing docking potentials 21 " 23 . These previous approaches relied on a local model of co- 

53 evolution that is less likely to disentangle indirect and therefore incorrect correlations from 

54 the direct co-evolution, as has been described in work on residue-residue interactions in 

55 single proteins 24 . More recently, reports using a global model have been successful in 

56 identifying residue interactions from evolutionary covariation, for instance between 

57 histidine kinases and response regulators 25 " 27 , and this approach has only recently been 

58 generalized and used to predict contacts between proteins in complexes of unknown 

59 structure, in an independent effort parallel to this work 28 . In principle, just a small number of 

60 key residue-residue contacts across a protein interface would allow computation of 3D 

61 models and provide a powerful, orthogonal approach to experiments. 

62 Since the recent demonstration of the use of evolutionary couplings (ECs) between residues 

63 to determine the 3D structure of individual proteins 29 " 33 , including integral membrane 

64 proteins 34 ' 35 , we reason that an evolutionary statistical approach such as EVcouplings 29 

65 could be used to determine co-evolved residues between proteins. To assess this hypothesis 

66 we built an evaluation set based on all known binary protein interactions in E. coli that have 

67 3D structures of the complex as recently summarized 5 . We develop a score for every 
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68 predicted inter-protein residue pair based on the overall inter-protein EC score distributions 

69 resulting in accurate predictions for the majority of top ranked /'/iter-protein EC pairs (inter- 

70 ECs) and sufficient to calculate accurate 3D models of the complexes in the docked subset, 

71 Figure 1A. This approach was then used to predict evolutionary couplings for 32 complexes 

72 of unknown 3D structures that have sufficient number of sequences, including previously 

73 published experimental support for our predicted unknown interactions between the a-, b- 

74 and c-subunit of ATP synthase. 

75 Results 

76 We first investigated whether co-evolving residues between proteins are close in three 

77 dimensions by assessing blinded predictions of residue co-evolution against experimentally 

78 determined 3D complex structures. We follow this evaluation by then predicting co-evolved 

79 residue pairs of interacting proteins that have no known complex structure. 

80 Extension of the evolutionary couplings method to protein complexes 

81 To compute co-evolution across proteins, individual protein sequences must be aligned 

82 paired up with each other that are presumed to interact, or being tested to see if they 

83 interact. Without this condition, proteins could be paired together that do not in fact 

84 interact with each other and therefore detection of co-evolution would be compromised. 

85 Given that the evolutionary couplings method depends on large numbers of diverse 

86 sequences 34 , some assumption must be made about which proteins interact with each other 

87 in homologous sequences in other species. Since it is challenging to know a priori whether 

88 particular interactions are conserved across many millions of years in thousands of different 

89 organisms, we use proximity of the two interacting partners on the genome as a proxy for 

90 this, with the goal of reducing incorrect pairings. 

91 To assemble the broadest possible data sets to test the approach and make predictions we 

92 take all known interacting proteins assembled in a published dataset that contains ~3500 

93 high-confidence protein interactions in E. coli 5 . After removing redundancy and requiring 

94 close genome distance between the pairs of proteins this results in 326 interactions, see 

95 Materials and methods (Figure IB, Figure 1 - figure supplement 1, Supplementary file 1 and 

96 2), 

97 The paired sequences are concatenated and statistical co-evolution analysis is performed 

98 using EVcouplings 29 ' 30 ' 32 / that applies a pseudolikelihood maximization (PLM) approximation 

99 to determine the interaction parameters in the underlying maximum entropy probability 

100 model 33,36 , simultaneously generating both intra- and inter-EC scores for all pairs of residues 

101 within and across the protein pairs (Figure 1A). Evolutionary coupling calculations in 

102 previous work have indicated that this global probability model approach requires a 

103 minimum number of sequences in the alignment with at least 1 non-redundant sequence 
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104 per residue " ' ' . Our current approach allows complexes with fewer available sequences 

105 to be assessed (minimum at 0.3 non-redundant sequences per residue) by using a new 

106 quality assessment score to assess the likelihood of the predicted contacts to be correct. The 

107 EVcomplex score is based on the knowledge that most pairs of residues are not coupled and 

108 true pair couplings are outliers in the high-scoring tail of the distribution (see Materials and 

109 methods, Figure 2A and 2B, Figure 2 - figure supplement 1 and 2). The score can intuitively 

110 be understood as the distance from the noisy background of non-significant pair scores, 

111 normalized by the number of non-redundant sequences and the length of the protein 

112 (Materials and methods, equations 1 and 2). If the number of sequences per residue is not 

113 controlled for, there is a large bias in in the results, overestimating performance with low 

114 numbers of sequences (Figure 2B and 2C). The precise functional form of the correction for 

115 low numbers of sequences was chosen non-blindly after observing the dependencies in the 

116 test set. 

117 Blinded prediction of known complexes 

118 Evolutionary covariation reveals inter-protein contacts. Of the 329 interactions identified 

119 that are close on the E. coli genome, 76 have a sufficient number of alignable homologous 

120 sequences and known 3D structures either in E. coli or in other species. This set was used to 

121 test the inter-protein evolutionary coupling predictions (Supplementary file 1). The 

122 relationship between the EVcomplex score and the precision of the corresponding inter- 

123 protein ECs suggests that on average 74% (69%) of the predicted pairs with EVcomplex score 

124 greater than 0.8 will be accurate to within 10A (8A) of an experimental structure of the 

125 complex (Figure 2C). Most complexes have at least one inter-protein predicted contact 

126 above the selected score threshold of 0.8 (53/76 complexes). Three complexes have more 

127 than 20 predicted inter-protein residue contacts which are over 80% accurate, namely the 

128 histidine kinase and response regulator system (78 residue pairs), t-RNA synthetase (32 

129 residue pairs and the vitamin B importer complex (21 residue pairs), with precision over 80% 

130 (complex numbers 330, 019, 130 respectively, Figure 2D, Figure 2 - figure supplements 3-8, 

131 Supplementary file 1). 

132 We suggest that users of EVcomplex consider predicted contacts that lie below the 

133 threshold of 0.8 in the context of other biological knowledge, where available, or in 

134 comparison to other higher scoring contacts for the same complex. In this way additional 

135 true positive inter-residue contacts can be distinguished from false positives. For instance, 

136 the ethanolamine ammonia-lyase complex (complex 065) has only 3 predicted inter-protein 

137 residue pairs above the score threshold, but in fact has 5 additional correct pairs with 

138 EVcomplex scores slightly below the threshold of 0.8 which cluster with the 3 high-scoring 

139 contacts on the monomers, indicating that they are also correct. 



4 



Downloaded from http://biorxiv.org/on September 18, 2014 



140 Some of the high confidence inter-protein ECs in the test set are not close in 3D space when 

141 compared to their known 3D structures. These false positives may be a result of assumptions 

142 in the method that are not always correct. This includes (1) the assumption that the 

143 interaction between paired proteins is conserved across species and across paralogs, and (2) 

144 that truly co-evolved residues across proteins are indeed always close in 3D, which is not 

145 always the case. In addition, the complexes may also exist in alternative conformations that 

146 have not necessarily all been captured yet by crystal or NMR structures, for instance in the 

147 case of the large conformational changes of the BtuCDF complex 37 . 

148 Docking is accurate with few pairs of predicted contacts. To test whether the computed 

149 inter-protein ECs are sufficient for obtaining accurate 3D structures of the whole complex, 

150 we selected 15 diverse examples (with 5 or more inter-protein residue contacts) for docking 

151 [Table 1, Figure 3, Figure 3 -figure supplement 1, Supplementary file 3) with HADDOCK 14,38 . 

152 The docking procedure is fast and generates 100 3D models of each complex using all 

153 residue pairs with EVcomplex scores above the selection threshold. We additionally dock 

154 negative controls to assess the amount of information added to the docking protocol by 

155 evolutionary couplings (500 models per run, no constraints other than center of mass, see 

156 Materials and methods). The best models for all 15 complexes docked with evolutionary 

157 couplings have interface RMSDs under 6 A, 12/15 have the best scoring model under 4A and 

158 the top ranked models for 11/15 are under 5A backbone interface RMSD compared to a 

159 crystal or NMR structure interface. Over 70% of the generated models are close to the 

160 experimental structures of the complexes (< 4A backbone iRMSD), compared to less than 

161 0.5% in the controls (and these were not high -ranked) (Figure 3 - figure supplement 1, 

162 Supplementary file 3, Supplementary data) Not surprisingly complexes that have the largest 

163 numbers of true positive predicted contacts perform the best when docking. For example, 

164 the ribosomal proteins RS3 and RS14 have 11 true positive inter-protein ECs and result in a 

165 top ranked model only 1.1 A iRMSD from the reference structure. More surprisingly, other 

166 complexes with a lower proportion of true positive inter-protein contacts, such as Ubiquinol 

167 oxidase (6 out of 11) or the epsilon and gamma subunits of ATP synthase (8 out of 15) also 

168 produced accurate predicted complexes, with an iRMSD of 1.8 and 1.4 A respectively. The 

169 docking experiments therefore demonstrate that inter-protein ECs, even in the presence of 

170 incorrect predictions, can be sufficient to give accurate 3D models of protein complexes, but 

171 more work will be needed to quantify the likelihood of successful docking from the 

172 predicted contacts. 

173 Conserved residue networks provide evidence of functional constraints. The top 10 inter- 

174 EC pairs between Metl and MetN are accurate to within 8A in the MetNl complex (PDB: 3tui 

175 39 ), resulting in an average 1.4 A iRMSD from the crystal structure for all 100 computed 3D 

176 models (Table 1, Supplementary file 3 and Supplementary data). The top 3 inter-EC residue 

177 pairs (K136-E108, A128-L105, and E74-R124, Metl-MetN respectively) constitute a residue 
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178 network coupling the ATP binding pocket of MetN to the membrane transporter Metl. This 

179 network calculated from the sequence alignment corresponds to residues identified 

180 experimentally that couple ATP hydrolysis to the open and closed conformations of the Metl 

181 dimer 39 (Figure 4A). The vitamin B12 transporter (BtuC) belongs to a different structural 

182 class of ABC transporters, but also uses ATP hydrolysis via an interacting ATPase (BtuD). The 

183 top 5 inter-ECs co-locate the L-loop of BtuC close to the Q-loop ATP-binding domain of the 

184 ATPase, hence coupling the transporter with the ATP hydrolysis state in an analogous way to 

185 Met I -MetN. The identification of these coupled residues across the different subunits 

186 suggests that EVcomplex identifies not only residues close in space, but also particular pairs 

187 that are constrained by the transporter function of these complexes 39,4 °. 

188 The ATP synthase £ and y subunit complex provides a challenge to our approach, since the £ 

189 subunit can take different positions relative to the y subunit, executing the auto-inhibition of 

190 the enzyme by dramatic conformational changes 41 . In a real-world scenario, where we 

191 might not know this a priori, there may be conflicting constraints in the evolutionary record 

192 corresponding to the different positions of the flexible portion of £ subunit. EVcomplex 

193 accurately predicts 6 of the top 10 inter-EC pairs (within 8A in the crystal structure lfsO 42 or 

194 3oaa 41 ), with the top 2 inter-ECs £A45-yL215 and £A40-yL207 providing contact between the 

195 subunits along an inter-protein beta sheet. The location of the C-terminal helices of the £ 

196 subunit is significantly different across 3 crystal structures (PDB IDs: lfsO 42 , laqt 43 , 3oaa 41 ). 

197 The top ranked intra-ECs support the conformation seen in laqt, with the C-terminal helices 

198 packed in an antiparallel manner and tucked against the N-terminal beta barrel (Figure 4B, 

199 green circles) and do not contain a high ranked evolutionary trace for the extended helical 

200 contact to the y subunit seen in lfsO or 3oaa (Figure 4B, grey box). Docking with the top 

201 inter-ECs results in models with 1.4 A backbone iRMSDs to the crystal structure for the 

202 interface between the N-terminal domain of the £ subunit and the y subunit (Table 1, 

203 Supplementary file 4). eD82 and yR222 connect the £-subunit via a network of 3 high-scoring 

204 intra-ECs between the N- and C-terminal helices to the core of the Fl ATP synthase. In 

205 summary, these examples suggest that inter-protein evolutionary couplings can provide 

206 residue relationships across the proteins that could aid identification of functional coupling 

207 pathways, in addition to obtaining 3D models of the complex. 

208 De novo prediction of unknown complexes. 

209 Prediction of interactions for 32 protein pairs with high-scoring evolutionary couplings. A 

210 total of 82 protein complexes with unknown 3D structure of the interaction that satisfy the 

211 conditions for the current approach, i.e. have sufficient sequences and are close in all 

212 genomes, were predicted using EVcomplex (all residue - residue inter protein evolutionary 

213 couplings scores are available in Supplementary data). 32 of these have high EVcomplex 

214 scores with at least one predicted contact (Figure 5, Figure 5 - figure supplement 1 and 2, 

215 and Supplementary file 4). Analysis of the inter-EC predictions for known 3D complex 
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216 structures shows that protein pairs with more high-scoring ECs (EVcomplex score > 0.8) have 

217 a higher proportion of true positives (Figure 2D). Hence, the protein complexes in the set of 

218 unknown structures with more high-scoring inter-ECs are the most likely to have predicted 

219 ECs that indicate residue pairs close in 3D (column Q, Supplementary file 2, the exact pairs 

220 can be found in Supplementary file 4). Three examples of predictions with multiple high- 

221 scoring inter-ECs include MetQ-Metl, UmuD-UmuC and DinJ-YafQ. The top 15 inter-ECs 

222 between MetQ and Metl are from one interface of MetQ to the Metl periplasmic loops, or 

223 the periplasmic end of the helices, consistent with the known binding of MetQ to Metl in the 

224 periplasm. 

225 The UmuD and UmuC complex is induced in the stress/SOS response facilitating the cleavage 

226 of UmuD to UmuD' (between C24 and G25) to form UmuD' 2 which then interacts with UmuC 

227 (DNA polymerase V) in order to copy damaged DNA 44 . The truncated dimer form (UmuD' 2 ) 

228 has at least two contrasting conformations where the N-terminal arm is placed on opposite 

229 sides of the dimer in one conformation or in close proximity in the alternative (Figure 5 - 

230 figure supplement 3). For 6/7 ECs above the score threshold, residues in UmuD predicted to 

231 interface with UmuC are co-located on one face of the dimer. Two residues (Y33, 138) are 

232 located in the N-terminal arm of UmuD that, after cleavage of the 24 N-terminal amino acids, 

233 may become available for binding UmuC. Since UmuD switches functions after this cleavage 

234 and can then bind UmuC, these inter ECs may identify the critical residues for translesion 

235 synthesis function 44 . Although the ECs from this UmuD arm to UmuC involve residues in two 

236 separate domains of UmuC (S 415 and Y 74), intra-monomer evolutionary couplings predict 

237 that these residues are close in UmuC (Figure 5 - figure supplement 3A, black rectangles). 

238 The relative positions of the contacting residues within each monomer therefore support 

239 the plausibility of the accuracy of the interaction interface. 

240 Whilst this manuscript was in review, the 3D structure of the previously unsolved biofilm 

241 toxin/antitoxin DinJ-YafQ complex was published (PDB: 3mlo 45 ), showing the intertwining of 

242 subunits in a heterotetrameric complex. 17/19 predicted EC residue pairs are within 8 A in 

243 this 3D structure (Supplementary file 4 and Supplementary data). In general, the agreement 

244 between our de novo predicted inter-protein ECs with available experimental data serves as 

245 a measure of confidence for the predicted residue pair interactions, and suggests that 

246 EVcomplex can be used to reveal 3D structural details of yet unsolved protein complexes 

247 given sufficient evolutionary information. 

248 EVcomplex predicts interacting protein pairs in a large complex. To investigate whether the 

249 EVcomplex score can also distinguish between interacting and non-interacting pairs of 

250 proteins, we use the E. coli ATP synthase complex as a test case. The ATP synthase structure 

251 is of wide biological interest (reviewed in 46 ) with a remarkable 3D structural arrangement, 

252 but completion of all aspects of the 3D structure has remained experimentally challenging 47 
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253 (Figure 6A). As a demonstration exercise, we calculated evolutionary couplings for all 28 

254 possible pair combinations of different ATP synthase subunits (centered around the E. coli 

255 ATP synthase) and transformed the ECs into EVcomplex scores for all inter-protein residue 

256 pairs (experimentally determined stoichiometry: a 3 P3Y6£ab 2 Cio, Supplementary file 5 and 

257 Supplementary data). Using the default EVcomplex score threshold of 0.8 to discriminate 

258 between interacting and non-interacting pairs of subunits, 24 of the 28 possible interactions 

259 between the subunits are correctly classified as interacting or non-interacting. The four 

260 incorrect predictions (namely: £ and c, y and c, £ and p, b and p, for which there is some 

261 experimental evidence) are not identified as interacting using the 0.8 EVcomplex threshold. 

262 Choosing a threshold lower than 0.8 does identify 2 of these as interacting but also 

263 introduces new false positives. The £ and |3 interaction in the crystal structure 3oaa 41 is a 

264 special case in that it involves a highly extended conformation of the last two helices of the £ 

265 subunit that reach up into the enzyme making contacts with the (3 subunit. The false 

266 negative EVcomplex score for this pair could be a result of the transience of their interaction 

267 or reflect a more general problem of lack of conservation of this interaction across the 

268 aligned proteins from different species. In total 80% of the interacting residue pairs in the 

269 known 3D structure parts of the synthase complex (7 pairs of subunits) are correctly 

270 predicted (threshold: 10A minimum atom distance between two residues). This exercise of 

271 prediction of presence or absence of interaction between any two proteins indicates the 

272 potential of the EVcomplex method in helping elucidate protein-protein interaction 

273 networks from evolutionary sequence co-variation and identify interacting subunits of large 

274 macromolecular complexes. 

275 EVcomplex predicts details of subunit interactions in ATP synthase. While much of the 3D 

276 structure of ATP synthase is known 46 , the details of interactions between the a- b-, and c- 

277 subunits have not yet been determined by crystallography. We analyse the details in these 

278 interactions, as the EVcomplex scores between these subunits are substantial (Figure 6B). 

279 We are fortunately able to provide a missing piece for this analysis, the unknown structure 

280 of the membrane-integral penta-helical a-subunit, using our previously described method 

281 for de novo 3D structure prediction of alpha-helical transmembrane proteins 34 . To our 

282 knowledge there are no experimentally determined atomic resolution structures of the a- 

283 subunit of ATP synthase. A 3D model of the a-subunit is from 1999 (lcl7 48 ) and was 

284 computed using five helical-helical interactions that were inferred from second suppressor 

285 mutation experiments, and then imposed as distance restraints for TMH2-5, revealing a four 

286 helical bundle (with no information for TMH1). Later, cross-linking experiments 49 identified 

287 contacting residues from all pairs of helical combinations of TM2-TM5 (6 pairs), supporting 

288 the earlier 4 helical bundle topology. 7 of the 8 cross-linked pairs are either exactly the same 

289 pair (L120-I246) or adjacent to many pairs in the top L intra a-subunit evolutionary couplings 

290 (ECs). 
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291 In fact, the helix packing arrangement in the predicted structure of the a-subunit is 

292 consistent with the topology suggested on the basis of crosslinking studies 50 ~ 52 , including the 

293 lack of contacts for transmembrane helix 1 with the other 4 helices (Supplementary data). 

294 The top inter-protein EC pair between subunits a and b, aK74-bE34, coincides with 

295 experimental crosslinking evidence of the interaction of aK74 with the b-subunit and the 

296 position of E34 of the b subunit emerging from the membrane on the cytoplasmic side 50,51 . 

297 Indeed, 6 of the 13 high score ECs are in the same region as the experimental crosslinks, for 

298 instance between the cytoplasmic loop between the first two helices of the a-subunit and 

299 the b-subunit helix as it emerges from the membrane bilayer 53 , a239V in TM helix 5 and 

300 bL16 (Figure 6C, Figure 6 -figure supplement 1, Supplementary file 6). Additionally, the top 

301 EC between the a- and c-subunits (aG213 - cM65) lies close to the functionally critical 

302 aR210-cD61 interaction 54 on the same helical faces of the respective subunits (Figure 6C). 

303 This prediction of missing aspects of subunit interactions may help in the design of targeted 

304 experiments to complete the understanding of the intricate molecular mechanism of the 

305 ATP synthase complex. 

306 Discussion 

307 A primary limitation of our current approach is its dependence on the availability of a large 

308 number of evolutionarily related sequences. If a protein interaction is conserved across 

309 enough sequenced genomes, using a single pair per genome can give accurate predictions of 

310 the interacting residues. However, if the protein pair is present in limited taxonomic 

311 branches, there may be insufficient sequences at any given time to make confident 

312 predictions. A solution to this could be to include multiple paralogs of the interacting 

313 proteins from each genome, but this requires correct pairing of the interaction partners, 

314 which is in general hard to ascertain. In addition, details of interactions may have diverged 

315 for paralogous pairs. Hence, in this current version of the method we have imposed a 

316 genome distance requirement across all genomes for all homolog pairs in order to be less 

317 sensitive to these complications. 

318 As the need to use genome proximity to pair sequences becomes less important with the 

319 increasing availability of genome sequences, there will be a dramatic increase in the number 

320 of interactions that can be inferred from evolutionary couplings, including those unique to 

321 eukaryotes. With currently available sequences (May 2014 release of the UniProt database), 

322 EVcomplex is able to provide information for about l/10 th of the known 3000 protein 

323 interactions in the E. coli genome. Once there are ~10,000 bacterial genome sequences of 

324 sufficient diversity, one would have enough information to test each potentially interacting 

325 pair of homologs for evidence of interaction and, given sufficiently strong evolutionary 

326 couplings, infer the 3D structure of each protein-protein pair, as well as of complexes with 

327 more than two proteins. For any set of species, e.g., vertebrates or mammals, one can 
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328 imagine guiding sequencing efforts to optimize species diversity to facilitate the extraction 

329 of evolutionary couplings. This can open the doors for more comprehensive and more rapid 

330 determination of approximate 3D structures of proteins and protein complexes, as well as 

331 for the elucidation in molecular detail of the most strongly evolutionarily constrained 

332 interactions, pointing to functional interactions. 

333 Determining the three-dimensional models of complexes from the predicted contacts was 

334 successful in many of the tested instances. Using minimal computing resources and a small 

335 number of inter-EC-derived contacts, low interface positional RMSDs relative to 

336 experimental structures can be achieved. However, a significant number of proteins exist as 

337 homomultimers within larger complexes. To determine models of these complexes one 

338 must deconvolute homomultimeric inter-ECs from the intra-protein signal, which is an 

339 important technical challenge for future work. 

340 The analysis of subunit interactions in ATP synthase in this work is a "proof of principle" 

341 study showing that methods such as EVcomplex can determine which proteins interact with 

342 each other at the same time as specific residue pair couplings across the proteins (as also 

343 shown in the work by the Baker lab on ribosomal protein interactions 28 ). Understanding the 

344 networks of protein interactions is of critical interest in eukaryotic systems, such as 

345 networks of protein kinases, GPCRs, or PDZ domain proteins. An understanding of the 

346 distributions of interaction specificities is of high interest to many fields. Although we do not 

347 know how well our evolutionary coupling approach will handle less obligate interactions, 

348 results on the two-component signalling system (histidine kinase/response regulator) both 

349 here and in other work 25 ' 26 suggest optimism. 

350 The approximately scale-free EVcomplex score is a heuristic based on the distribution of raw 

351 EC scores from the statistical model, their dependence on sequence alignment depth and 

352 the length of the concatenated sequences. The score provides a simple way of accounting 

353 for these dependencies such that a uniform threshold, say 0.8, can be used for any protein 

354 pair with the expectation of reasonably accurate predictions. Since cutoff thresholds can be 

355 useful but overly sharp, we recommend investigating predicted contacts below the 

356 threshold used in this work, especially where there is independent biological knowledge to 

357 validate the predictions. 

358 The work presented here is in anticipation of a genome-wide exploration and, as a proof of 

359 principle, shows the accurate prediction of inter-protein contacts in many cases and their 

360 utility for the computation of 3D structures across diverse complex interfaces. As with single 

361 protein (intra-EC) predictions, evolutionarily conserved conformational flexibility and 

362 oligomerization can result in more than one set of contacts that must be de-convoluted. Can 

363 evolutionary information help to predict the details and extent for each complex? A key 

364 challenge will be the development of algorithms that can disentangle evolutionary signals 
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365 caused by alternative conformations of single complexes, alternative conformations of 

366 homologous complexes, and effectively deal with false positive signals. Taken together, 

367 these issues highlight fruitful areas for future development of evolutionary coupling 

368 methods. 

369 Despite conditions for the successful de novo calculation of co-evolved residues, the method 

370 described here may accelerate the exploration of the protein-protein interaction world and 

371 the determination of protein complexes on a genome-wide scale at residue level resolution. 

372 The use of co-evolutionary analysis in computational models to determine protein specificity 

373 and promiscuity, co-evolutionary dynamics and functional drift will open up exciting future 

374 research questions. 

375 
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376 Materials and methods 

377 Selection of interacting protein pairs for co-evolution calculation. 

378 The candidate set of complexes for testing and de novo prediction was derived starting from 

379 a dataset of binary protein-protein interactions in E. coli including yeast two-hybrid 

380 experiments, literature-curated interactions and 3D complex structures in the PDB 5 . Three 

381 complexes not contained in the list were added based on our analysis of other subunits in 

382 the same complex, namely BtuC/BtuF, Metl/MetQ, and the interaction between ATP 

383 synthase subunits a and b. Since our algorithm for concatenating multiple sequence pairs 

384 per species assumes the proximity of the interacting proteins on the respective genomes of 

385 each species (see below), we excluded any complex with a gene distance > 20 from further 

386 analysis. The gene distance is calculated as the number of genes between the interacting 

387 partners based on an ordered list of genes in the E. coli genome obtained from the UniProt 

388 database. The resulting list of pairs (~ 350) was then filtered for pseudo-homomultimeric 

389 complexes based on the identification of Pfam domains in the interacting proteins (330). All 

390 remaining complexes with a known 3D structure (as summarized in 5 ) or a homologous 

391 interacting 3D structure (93) (identified by intersecting the results of HMMER searches 

392 against the PDB for both monomers) were used for evaluating the method, while complexes 

393 without known structure (236) were assigned to the de novo prediction set (Figure 1 -figure 

394 supplement 1). The set with protein complexes of known 3D structure was further filtered 

395 for structures that only cover fragments (< 30 amino acids) of one or both of the monomers 

396 and structures with very low resolution (> 5A), which led to the re-assignment of 

397 Ribonucleoside-diphosphate reductase 1 (complex_002), Type I restriction-modification 

398 enzyme EcoKI (complex_012), RpoC/RpoB (complex_041), RL11/RI7 (complex_165), the 

399 ribosome with SecY (complex_226, complex_250, and complex_255), and RS3/RS 

400 (complex_254) to the set of unknown complexes. Large proteins were run with the specific 

401 interacting domains informed by the known 3D structure, when the full sequence was too 

402 large for the number of retrieved sequences, (for domain annotation see Supplementary 

403 data.) 

404 This set could serve as a benchmark set for future development efforts in the community. 

405 Multiple sequence alignments. 

406 Each protein from all pairs in our dataset was used to generate a multiple sequence 

407 alignment (MSA) using jackhmmer 55 to search the UniProt database 56 with 5 iterations. To 

408 obtain alignments of consistent evolutionary depths across all the proteins, a bit score 

409 threshold of 0.5 * monomer sequence length was chosen as homolog inclusion criterion (- 

410 incdomT parameter), rather than a fixed f-value threshold which selects for different 

411 degrees of evolutionary divergence based on the length of the input sequence. 

412 In order to calculate co-evolved residues across different proteins, the interacting pairs of 

413 sequences in each species need to be matched. Here, we assume that proteins in close 
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414 proximity on the genome, e.g., on the same operon, are more likely to interact, as in the 

415 methods used previously matching histidine kinase and response regulator interacting pairs 

416 25,26 (Supplementary data). We retrieved the genomic locations of proteins in the alignments 

417 and concatenated pairs following 2 rules: (i) The CDS of each concatenated protein pair must 

418 be located on the same genomic contig (using ENA 57 for mapping), and (ii) each pair must be 

419 the closest to one another on the genome, when compared to all other possible pairings in 

420 the same species. The concatenated sequence pairs were filtered based on the distribution 

421 of genomic distances to exclude outlier pairs with high genomic distances of more than 10k 

422 nucleotides (Supplementary data). Alignment members were clustered together and 

423 reweighted if 80% or more of their residues were identical (thus implicitly removing 

424 duplicate sequences from the alignment). Supplementary file 1 and 2 report the total 

425 number of concatenated sequences, the lengths, and the effective number of sequences 

426 remaining after down-weighting in the evaluation and de novo prediction set, respectively. 

427 Computation of evolutionary couplings. 

428 Inter- and intra-ECs were calculated on the alignment of concatenated sequences using a 

429 global probability model of sequence co-evolution, adapted from the method for single 

430 proteins ' ' using a pseudo-likelihood maximization (PLM) rather than mean field 

431 approximation to calculate the coupling parameters. Columns in the alignment that contain 

432 more than 80% gaps were excluded and the weight of each sequence was adjusted to 

433 represent its cluster size in the alignment thus reducing the influence of identical or near- 

434 identical sequences in the calculation. For the evaluation set we can then compare the 

435 predicted ECs for both within and between the protein/domains to the crystal structures of 

436 the complexes (for contact maps and all EC scores, see Supplementary data). 

437 Definition of a scale-free score for the assessment of interactions. 

438 In order to estimate the accuracy of the EC prediction we evaluate the calculated inter-ECs 

439 based on the following observations: (1) most pairs of positions in an alignment are not 

440 coupled, i.e. have an EC score close to zero, and tend to be distant in the 3D structure; (2) 

441 the background distribution of EC scores between non-coupled positions is approximately 

442 symmetric around a zero mean; and (3) higher-scoring positive score outliers capture 3D 

443 proximity more accurately than lower-scoring outliers (see also Figure 2). The width of the 

444 (symmetric) background EC score distribution can be approximated using the absolute value 

445 of the minimal inter-EC score. The more a positive EC score exceeds the noise level of 

446 background coupling, the more likely it is to reflect true co-evolution between the coupled 

447 sites. For each inter-protein pair of sites i and j with pair coupling strength EC inte r(i, j), we 

448 therefore calculate a raw reliability score ('pair coupling score ratio', Figure 2B) defined by 
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EC (i i) 

449 QZ T (i, 3) = (1) 



min [EC. t [i, j 

i inter » ' ^ 



450 Since the accuracy of evolutionary couplings critically depends both on the number and 

451 diversity of sequences in the input alignment and the size of the statistical inference 

452 problem 29 31 we incorporate a normalization factor to make the raw reliability score 

453 comparable across different protein pairs. The normalized EVcomplex score is defined as 



^raw 

454 EVcomplex-Score(z, j) = — '— (2) 



1 + 



\ — 

2 



455 where N e ff is the effective number of sequences in the alignment after redundancy reduction, 

456 and L (total number of residues) is the length of the concatenated alignment. Previous work 

457 on single proteins has shown that the method requires a sufficient number of sequences in 

458 the alignment to be statistically meaningful. We thus filter for sequence sufficiency requiring 

459 Neft/L > 0.3 (Table 1, Supplementary files 1 and 2). Predictions of coupled residues in the 

460 evaluation set were evaluated against their residue distances in known structures of protein 

461 pairs 5 (see Supplementary file 7) in order to determine the precision of the method. 

462 To interpret the EVcomplex prediction of interaction between subunits a and b of the ATP 

463 synthase as well as UmuC and UmuD, individual monomer models were built de novo for the 

464 structurally unsolved subunit-a of ATP synthase and UmuC using the EVfold pipeline as 

465 previously published 29,34 . In both cases coupling parameters were calculated using PLM 36 

466 and sequences were clustered and weighted at 90% sequence identity (the resulting models 

467 are provided in Supplementary data). 

468 Prediction of interactions in a set of subunits. 

469 Following this same protocol EVcomplex scores were calculated for all possible 28 

470 combinations of the 8 E. coli ATP synthase F 0 and Fi subunits. Since we want to compare the 

471 computational predictions to some 'ground truth', as with the complexes for the rest of the 

472 manuscript, we used known 3D structures of the ATP synthase complex to assign whether or 

473 not the subunits interact (3oaa, lfsO, 2a7u Supplementary file 7). Since we are also 

474 determining whether the subunits interact, not necessarily knowing full atomic detail 

475 residue interactions, we included subunit interactions that have been inferred from cryo-EM, 

476 crosslinking or other experiments, but do not necessarily have a crystal structure. These are 

477 represented as solid blue boxes, if the interaction is well established 53,58 " 60 , or crosshatched 

478 blue if there is a lack of consensus in the community, left panel Figure 6B. 
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479 For each possible interaction the EVcomplex score of the highest ranked inter-EC was 

480 considered as a proxy for the likelihood of interaction. Pairs with scores above 0.8 are 

481 considered likely to interact, between 0.75 and 0.8 weakly predicted, while interactions with 

482 scores below 0.75 are rejected as possible complexes, blue boxes, blue crosshatched and 

483 white respectively, right panel Figure 6B and Supplementary data. 

484 Computation of 3D structure of complexes. 

485 A diverse set of 15 complexes was chosen from the 22 in the evaluation set that had at least 

486 5 couplings above a complex score of 0.8 and were subsequently docked (Supplementary file 

487 3). Proteins that have been crystallized together in a complex could bias the results of the 

488 docking, as they have complementary positions of the surface side chains. Therefore, where 

489 possible we used complexes that had a solved 3D structure of the unbound monomer, 

490 namely GcsH/GcsT, CyoA, FimC, DhaL, AtpE, PtqA/PtqB, RS10 and HK/RR, and in all other 

491 cases the side chains of the monomers were randomized either by using SCWRL4 61 or 

492 restrained minimization with Schrodinger Protein Preparation Wizard 62 before docking. For 

493 ubiquinol oxidase (complex_054) the unbound structure of subunit 2 (CyoA) only covers the 

494 COX2 domain. In this case docking was performed using this unbound structure plus an 

495 additional run using the bound complex structure with perturbed side chains. 

496 We used HADDOCK 14 , a widely used docking program based on ARIA 63 and the CNS 

497 software 64 (Crystallography and NMR System), to dock the monomers for each protein pair 

498 with all inter-ECs with an EVcomplex score of 0.8 or above implemented as distance 

499 restraints on the a-carbon atoms of the backbone. 

500 Each docking calculation starts with a rigid-body energy minimization, followed by semi- 

501 flexible refinement in torsion angle space, and ends with further refinement of the models in 

502 explicit solvent (water). 500/100/100 models generated for each of the 3 steps, respectively. 

503 All other parameters were left as the default values in the HADDOCK protocol. Each protein 

504 complex was run using predicted ECs as unambiguous distance restraints on the Cot atoms 

505 (d e ff 5A, upper bound 2A, lower bound 2A; input files available in Supplementary data). As a 

506 negative control, each protein complex was also docked using center of mass restraints (ab 

507 initio docking mode of HADDOCK) 38 alone and in the case of the controls generating 

508 10000/500/500 models. 

509 Each of the generated models is scored using a weighted sum of electrostatic (E e iec) and van 

510 der Waals (E vdw ) energies complemented by an empirical desolvation energy term (E de soiv) 65 - 

511 The distance restraint energy term was explicitly removed from the equation in the last 

512 iteration (Edist3 = 0.0) to enable comparison of the scores between the runs that used a 

513 different number of ECs as distance restraints. 
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5 14 Comparison of predicted to experimental structures. 

515 All computed models in the docked set were compared to the cognate crystal structures by 

516 the RMSD of all backbone atoms at the interface of the complex using ProFit v. 3.1 

517 (http://www.bioinf.org.uk/software/profit/). The interface is defined as the set of all 

518 residues that contain any atom < 6 A away from any atom of the complex partner. For the 

519 AtpE-AtpG complex we excluded the 2 C-terminal helices of AtpE as these helices are mobile 

520 and take many different positions relative to other ATP synthase subunits 41 . Similarly, since 

521 the DHp domain of histidine kinases can take different positions relative to the CA domain, 

522 the HK-RR complex was compared over the interface between the DHp domain alone and 

523 the response regulator partner. In the case of the unbound ubiquinol oxidase docking results, 

524 only the interface between C0X2 in subunit 2 and subunit 1 was considered. Accuracy of the 

525 computed models with EC restraints were compared with computed models with center of 

526 mass restraints alone (negative controls), Figure 3 -figure supplement 1, Supplementary file 

527 3). 

528 Data analysis was conducted primarily using IPython notebooks 66 . A webserver and all data 

529 is made EVcomplex.org. 
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Table Legends 



Table 1. EVcomplex predictions and docking results for 15 protein complexes 

EVcomplex contacts Docking quality (iRMSD) 



Complex Name 



Subunits 



b TP Top ranked Best 
rate c model d model 6 



Seqs 



Carbamoyl-phosphate synthase CarB:CarA 



2.3 



17 0.88 



1.9 



1.9 



Aminomethyltransferase/ GcsH:GcsT 
Glycine cleavage system H protein 



2.9 



0.2 



5.4 



5.4 



Histidine kinase/ 
response regulator 



KdpD:CheY 
(T. maritima) 



95.4 



78 0.72 



2.1 



2.0 



Ubiquinol oxidase 



CyoB:CyoA 



Outer membrane usher protein/ FimD:FimC 
Chaperone protein 



1.0 
3.6 



11 0.55 
6 0.83 



1.8 
3.2 



1.2 
3.0 



Molybdopterin synthase MoaD:MoaE 
Methionine transporter complex MetN:Metl 



3.6 
1.9 



8 1.0 
14 0.86 



4.4 
1.5 



4.1 
1.2 



Dihydroxyacetone kinase 



DhaLDhaK 



1.4 



12 0.42 



6.7 



2.4 



Vitamin B12 uptake system 



BtuC:BtuF 



3.2 



0.6 



2.8 



2.8 



Vitamin B12 uptake system 



BtuC:BtuD 



9.8 



21 0.88 



1.1 



0.9 



ATP synthase y and e subunits AtpE:AtpG 



2.9 



15 0.53 



1.4 



1.4 



IIA-IIB complex of the N,N'- 
diacetylchitobiose (Chb) 
transporter 



PtqA:PtqB 



3.1 



0.2 



7.2 



5.5 



30 S Ribosomal proteins 



RS3:RS14 



1.4 



11 0.91 



1.1 



1.1 



Succinatequinone oxido-reductase SdhB:SdhA 
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Figure Legends 

Figure 1. Figure 1. Co-evolution of residues across protein complexes from the evolutionary 
sequence record. (A) Evolutionary pressure to maintain protein-protein interactions leads to 
the coevolution of residues between interacting proteins in a complex. By analyzing patterns 
of amino acid co-variation in an alignment of putatively interacting homologous proteins (left), 
evolutionary couplings between coevolving inter-protein residue pairs can be identified 
(middle). By defining distance restraints on these pairs, the 3D structure of the protein 
complex can be inferred using docking software (right). (B) Distribution of E. coli protein 
complexes of known and unknown 3D structure where both subunits are close on the 
bacterial genome (left), allowing sequence pair matching by genomic distance. For a subset of 
these complexes, sufficient sequence information is available for evolutionary couplings 
analysis (dark blue bars). As more genomic information is created through on-going 
sequencing efforts, larger fractions of the E. coli interactome become accessible for 
EVComplex (right). A detailed version of the workflow used to calculate all E. coli complexes 
currently for which there is currently enough sequence information is shown in Figurel - 
figure supplement 1. 

Figure 2. Evolutionary couplings capture interacting residues in protein complexes. (A) Inter- 
and Intra-EC pairs with high coupling scores largely correspond to proximal pairs in 3D, but 
only if they lie above the background level of the coupling score distribution. To estimate this 
background noise a symmetric range around 0 is considered with the width being defined by 
the minimum inter-EC score. For the protein complexes in the evaluation set this distribution 
is compared to the distance in the known 3D structure of the complex that is shown here for 
the methionine transporter complex, MetNI. (Plots for all complexes in the evaluation set are 
shown in Figure 2 - figure supplement 1 and 2). (B) A larger distance from the background 
noise (ratio of EC score over background noise line) gives more accurate contacts. 
Additionally, the higher the number of sequences in the alignment the more reliable the 
inferred coupling pairs are which then reduces the required distance from noise (different 
shades of blue). Residue pairs with an 8A minimum atom distance between the residues are 
defined as true positive contacts, and precision = TP/(TP+FP). The plot is limited to range (0,3) 
which excludes the histidine kinase - response regulator complex (HK-RR) - a single outlier 
with extremely high number of sequences. (C) To allow the comparison across protein 
complexes and to estimate the average inter-EC precision for a given score threshold 
independent of sequence numbers, the raw couplings score is normalized for the number of 
sequences in the alignment, the EVcomplex score. In this work, inter-ECs with a score > 0.8 
are used. Note: the shown figure is cut off at score of 2 in order to zoom in on the phase 
change region and the high sequence coverage outlier HK-RR is excluded. (D) For complexes 
in the benchmark set, inter-EC pairs with EVcomplex score > 0.8 give predictions of 
interacting residue pairs between the complex subunits to varying accuracy (8A TP distance 
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cutoff). All predicted interacting residues for complexes in the benchmark set that had at 
least one inter-EC above 0.8 are shown as contact maps in Figure 2 - figure supplement 3-8. 

Figure 3. Blinded prediction of evolutionary couplings between complex subunits with 
known 3D structure. Inter-ECs with EVcomplex score > 0.8 on a selection of benchmark 
complexes (monomer subunits in green and blue, inter ECs in red, pairs closer than 8A by 
solid red lines, dashed otherwise). The predicted inter-ECs for these ten complexes were then 
used to create full 3D models of the complex using protein-protein docking For the fifteen 
complexes for which also 3D structures were predicted using docking, energy funnels are 
shown in Figure 3 - figure supplement 1. 

Figure 4. Evolutionary couplings give accurate 3D structures of complexes. EVcomplex 
predictions and comparison to crystal structure for (A) the methionine-importing 
transmembrane transporter heterocomplex MetNl from E. coli (PDB: 3tui) and (B) the 
gamma/epsilon subunit interaction of E. coli ATP synthase (PDB: lfsO). Left panels: Complex 
contact map comparing predicted inter-ECs with EVcomplex score > 0.8 (red dots, upper right 
quadrant) and intra-ECs (up to the last chosen inter-EC rank; green and blue dots, top left and 
lower right triangles) to close pairs in the complex crystal (dark/mid/light grey points for 
minimum atom distance cutoffs of 5/8/12 A for inter-subunit contacts and dark/mid grey for 
5/8 A within the subunits). Inter-ECs with an EVcomplex score > 0.8 are also displayed on the 
spatially separated subunits of the complex (red lines on green and blue cartoons, couplings 
closer than 8A in solid red lines, dashed otherwise, lower left). Right panels: Superimposition 
of the top ranked model from 3D docking (green/blue cartoon, left) onto the complex crystal 
structure (grey cartoon), and close-up of the interface region with highly coupled residues 
(green/blue spheres). 

Figure 5. Evolutionary couplings in complexes of unknown 3D structure. Inter-ECs for five de 
novo prediction candidates without E. coli or interaction homolog complex 3D structure 
(Subunits: blue/green cartoons; inter-ECs with EVcouplings score > 0.8: red lines). For 
complex subunits which homomultimerize (light/dark green cartoon), inter-ECs are placed 
arbitrarily on either of the monomers to enable the identification of multiple interaction sites. 
Contact maps for all complexes with unsolved structures are provided in Figure 5 - figure 
supplement 1 and 2. Left to right: (1) the membrane subunit of methionine-importing 
transporter heterocomplex Metl (PDB: 3tui) together with its periplasmic binding protein 
MetQ (Swissmodel: P28635); (2) the large and small subunits of acetolactate synthase NvB 
(Swissmodel: P08142) and NvN (PDB: 2lvw); (3) panthotenate synthase PanC (PDB: liho) 
together with ketopantoate hydroxymethyltransferase PanB (PDB: lm3v); (4) subunits a and 
b of ATP synthase (model for a subunit a predict with EVfold-membrane, PDB: lb9u for b 
subunit), for detailed information see Figure 5; and (5) the in DNA repair and SOS mutagenisis 
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involved complex UmuC (model created with EVfold) with one possible conformation of 
UmuD (PDB: li4v). For alternative UmuD conformation, see Figure 5 - figure supplement 3. 

Figure 6. Predicted interactions between the a-, b- and c- subunits of ATP synthase. (A) The 

a- and b- subunits of E. coli ATP synthase are known to interact, but the monomer structure 
of subunits a and b and the structure of their interaction in the complex are unknown. (B) 
EVcomplex prediction (right matrix) for ATP synthase subunit interactions compared to 
experimental evidence (left matrix), which is either strong (left, solid blue squares) or 
indicative (left, crosshatched squares). Interactions that have experimental evidence, but are 
not predicted at the 0.8 threshold are indicated as yellow dots. (C) Left panel: Residue detail 
of predicted residue-residue interactions (dotted lines) between subunit a and b (residue 
numbers at the boundaries of transmembrane helices in grey). Right panel: Proposed helix- 
helix interactions between ATP synthase subunits a (green), b (blue, homodimer), and the c 
ring (grey). The proposed structural arrangement is based on analysis of the full map of inter- 
subunit ECs with a EVcomplex score larger than 0.8 (Figure 6 - figure supplement 1). 
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Figure Supplements 

Figure 1 - figure supplement 1: Details of the EVcomplex Pipeline 

Figure 2 - figure supplement 1-2: Distribution and accuracy of raw EC scores for all 
complexes in evaluation set 

Figure 2 - figure supplement 3-8: Contact maps of all complexes with solved 3D structure 
with inter-ECs above EVcomplex score of 0.8. Predicted coevolving residue pairs with an 
EVcomplex score > 0.8 and all inter-ECs up to the rank of the last include inter-EC are 
visualized in complex contact maps (red dots: inter-ECs, green and blue dots: intra-ECs for 
monomer 1 and 2, respectively). Top left and bottom right quadrants: intra-ECs; top right and 
bottom left quadrants: inter-ECs. Inter- and intra-protein crystal structure contacts at 
minimum atom distance cutoffs of 5/8/12 A are shown as dark/middle/light grey dots, 
respectively; missing data in the crystal structure as shaded blue rectangles. 

Figure 3 - figure supplement 1: Comparison of Interface RMSD to HADDOCK score. The 
HADDOCK scores of docked models are plotted against their iRMSDs to the bound complex 
crystal. Grey data points correspond to models created without any ECs as unambiguous 
restraints whereas blue dots correspond to model created using all inter-couplings with 
EVcomplex score > 0.8. HADDOCK score outliers with scores > 100 are not shown, and any 
model with an iRMSD > 35A is displayed as iRMSD=35 A for visualization purposes 

Figure 5 - figure supplement 1-2: Contact maps of all complexes without solved 3D structure 
with at least one inter-ECs above EVcomplex score of 0.8. Inter-ECs are shown as red dots in 
the top right and bottom left quadrant while intra-ECs of the two monomers are shown in 
green and blue in the top left and bottom right quadrant, respectively. 

Figure 5 - figure supplement 3: Details of the predicted UmuCD interaction residues 

Figure 6 - figure supplement 1: Contact map of predicted ECs in the ATPsynthase a and b 
subunits. Inter-ECs are shown as red dots in the top right and bottom left quadrant while 
intra-ECs of the two monomers are shown in green and blue in the top left and bottom right 
quadrant, respectively. 
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Supplementary Files 

Supplementary file 1: Benchmark dataset and results 
Supplementary file 2: Unknowns dataset and results 
Supplementary file 3: Docking results 

Supplementary file 4: Predicted inter-ECs for complexes in de novo prediction dataset with 
EVcomplex score > 0.8 

Supplementary file 5: ATPsynthase predictions 

Supplementary file 6: Comparison of ATP synthase EVcomplex predictions of a and b subunit 
with cross-linking studies 

Supplementary file 7: PDB identifiers used for comparison of predicted evolutionary couplings 
to known 3D structures 

Supplementary Data 

Supplementary data 1: Concatenated alignments for complexes predicted in this work 

Supplementary data 2: Genome distance distribution of concatenated sequences per 
alignment 

Supplementary data 3: EVcomplex predictions for evaluation and de novo set 
Supplementary data 4: Docking input files and top 10 predicted models for evaluation set 
Supplementary data 5: ATP synthase predictions, ATP synthase subunit a model 
Supplementary data 6: UmuC model 
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Figure 1 
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Figure 3 

Blinded prediction of inter-protein contacts in complexes with known 3D structure 




BtuC-BtuD DhaK-DhaL CarB-CarA GcsT-GcsH RS3-RS14 
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Figure 4 
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Figure 5 
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Figure 6 

Prediction of ihe partially known complex of ATP synthase 
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Y>t: known complex 
structure (Fig. 3b) 
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